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Markov chain Monte Carlo sampling methods often suffer from long 
correlation times. Consequently, these methods must be run for 
many steps to generate an independent sample. In this paper a 
method is proposed to overcome this difficulty. The method uti- 
lizes information from rapidly equilibrating coarse Markov chains 
that sample marginal distributions of the full system. This is accom- 
plished through exchanges between the full chain and the auxiliary 
coarse chains. Results of numerical tests on the bridge sampling 
and filtering/smoothing problems for a stochastic differential equa- 
tion are presented. 

Markov chain Monte Carlo | renormalization | multi-grid | filtering | param- 
eter estimation 

In order to understand the behavior of a physical system 
it is often necessary to generate samples from complicated 
high dimensional distributions. The usual tools for sampling 
from these distributions are Markov chain Monte Carlo meth- 
ods (MCMC) by which one constructs a Markov chain whose 
trajectory averages converge to averages with respect to the 
distribution of interest. For some simple systems it is pos- 
sible to construct Markov chains with independent values at 
each step. In general, however, spatial correlations in the sys- 
tem of interest result in long correlation times in the Markov 
chain and hence slow convergence of the chain's trajectory av- 
erages. In this paper, a method is proposed to alleviate the 
difficulties caused by spatial correlations in high dimensional 
systems. The method, parallel marginalization, is tested on 
two stochastic differential equation conditional path sampling 
problems. 

Parallel marginalization takes advantage of the shorter 
correlation lengths present in marginal distributions of the 
target density. Auxiliary Markov chains that sample approx- 
imate marginal distributions are evolved simultaneously with 
the Markov chain that samples the distribution of interest. 
By swapping their configurations, these auxiliary chains pass 
information between themselves and with the chain sampling 
the original distribution. As shown below, these swaps are 
made in a manner consistent with both the original distribu- 
tions and the approximate marginal distributions. The nu- 
merical examples indicate that improvement in efficiency of 
parallel marginalization over standard MCMC techniques can 
be significant. 

The design of efficient methods to approximate marginal 
distributions was addressed in [I] by Chorin and in ''(2^ by 
Stinis. The use of Monte Carlo updates on coarse subsets 
of variables is not a new concept (see [3] and the references 
therein). The method presented in [3] does not use marginal 
distributions. However, attempts have been made previously 
to use marginal distributions to accelerate the convergence of 
MCMC (see [H [5]). In contrast to parallel marginalization, 
the methods proposed in [4] and [5] do not preserve the distri- 
bution of the full system and therefore are not guaranteed to 
converge. The parallel construction used here is motivated by 



the parallel tempering method (see [6]), and allows efficient 
comparison of the auxiliary chains and the original chain. See 
references and [7] for expositions of standard MCMC meth- 
ods. 

Parallel marginalization for problems in Euclidean state 
spaces is described in detail in the next two sections. In 
the final sections the conditional path sampling problem is 
described and numerical results are presented for the bridge 
sampling and smoothing/filtering problems. 

Parallel Marginalization 

For the purposes of the discussion in this section, we assume 
that appropriate approximate marginal distributions are avail- 
able. As discussed in a later section, they may be provided 
by coarse models of the physical problem as in the examples 
below, or they may be calculated via the methods in [l] and 

Assume that the do dimensional system of interest has a 
probability density, 7ro(a;o), where xo G . Suppose further 
that, by the Metropolis-Hastings or any other method (see 
[B]), we can construct a Markov chain, G R'*° , which has tto 
as its stationary measure. That is, for two points 2:0,2/0 G R''" 

j To{xo ^ 2/o)7ro(a;o) dxo = 710(2/0) 

where ro(xo yo) is the probability density of a move to 
jy-ji+i _ ^^^1 gj-^gjj ^jja,t {Yo" = xq}. Here, n is the algorith- 
mic step. Under appropriate conditions (see 0), averages 
over a trajectory of {Y^} will converge to averages over ttq, 
i.e. for an objective function g{xo) 

^E^W)^E[3(Xa)] 

n = 

The size of the error in the above limit decreases as the rate 
of decay of the time autocorrelation 

corr [g{Y^),g{YS)] = 

E [(g (Fo") - E [g (Xo)]) (g {YS) - E [g {X,)])] 
Var [g (Xo)] 

increases. In this formula, Fo" is assumed to be drawn from 

TTO. 

It is well known that judicious elimination of variables by 
renormalization can reduce long range spatial correlations (see 
e.g. [8]). The variables are removed by averaging out their ef- 
fects on the full distribution. If the original density is ii{x,x) 
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and we wish to remove the x variables, the distribution of the 
remaining x variables is given by the marginal density (see 

mm) 

[1] 



TT (x) = TV (x, x) dx 



The full distribution can be factored as 

'k{x,x) = 7f(i:)7r(i:|a;) 

where 7r(a;|a;) is the conditional density of x given x. Because 
they exhibit shorter correlation lengths, the marginal distri- 
butions are useful in the acceleration of Markov chain Monte 
Carlo methods. 

With this in mind we consider a collection of lower dimen- 
sional Markov chains Y" G R**' which have stationary distri- 
butions ni{xi) where do > ■ ■ ■ > di. For each i < L let Ti be 
the transition probability density of Y", i.e. Ti{xi yi) is 
the probability density of {Yl^^^ = j/i} given that {1^" — Xi\. 
The {ni} are approximate marginal distributions. For exam- 
ple, divide the Xi variables into two subsets, Xi G R'*»+i and 
Xi G jjrfi-'^i+i ^ go that Xi — {xi,Xi). The Xi variables repre- 
sent the variables of Xi that are removed by marginalization, 
i.e. 

7r,+ i ^ J {xi,Xi) dxi. 

After arranging these chains in parallel we have the larger 
process 



The probability density of a move to |Y"+i = y| given that 
{Y" = x} for x,y eR''" X ■ ■ ■ xR''^ is given by 



T{x^y) = l[T,{x, 



[2] 



Since 



J ( T{x y)Y\'^i (3^0 j dxo . . . dxL = Y\.^^ ^y^) 



the stationary distribution of is 

Il(xo, . . . ,xl) = tto {xo) . . .ttl (xl) ■ 

The next step in the construction is to allow interactions 
between the chains {Y"} and to thereby pass information 
from the rapidly equilibrating chains on the lower dimensional 
spaces (large i) down to the chain on the original space (i — 0). 
This is accomplished by swap moves. In a swap move between 
levels i and i -I- 1, we take a di+i dimensional subset, Xi, of 
the Xi variables and exchange them with the Xi+i variables. 
The remaining di — di+i Xi variables are resampled from the 
conditional distribution tt^ {xi\xi+i). For the full chain, this 
swap takes the form of a move from {Y" = x} to |y"+^ — y| 
where 

X — (. . . . . . ) 



and 



y = {.. . ,x^+i,yi,Xi, ...). 



The ellipses represent components of that remain un- 
changed in the transition and iji is drawn from tt^ (xi\xij^i). 



If these swaps are undertaken unconditionally, the result- 
ing chain with equilibrate rapidly, but will not, in general, 
preserve the product distribution 11. To remedy this we in- 
troduce the swap acceptance probability 

A.^minll, |'(^'+i)-m(*0 



■Ki{Xi)-KiJ^l(Xi + i) j ^ ^ 

In this formula vfi is the function on R'^'+i resulting from 
marginalization of tt^ as in equation [T] Given that {Y" — x}, 
the probability density of |Y""+^ — jy}, after the proposal and 
either acceptance with probability Ai or rejection with prob- 
ability 1 — Ai, of a swap move, is given by 

Si {x ^y) ^ [1- Ai) 5{„=^} 



for x,y £ 



. 5 is the Dirac delta function. 



We have the following lemma. 

Lemma 1. The transition probabilities Si satisfy the detailed 
balance condition for the measure 11, i.e. 

U{x) S, {x^y) = n{y) S, {y ^ x) 

where x,y & R''^ x ■ ■ ■ x R''^ . 

The detailed balance condition stipulates that the probability 
of observing a transition x —> y is equal to that of observing 
a transition y ^ x and guarantees that the resulting Markov 
Chain preserves the distribution 11. Therefore, under general 
conditions, averages over a trajectory of {Y"} will converge 
to averages over 11. Since 



Tro{xo) = J U{xo, . . . ,xl) dxi 



. dXT 



we can calculate averages over no by taking averages over the 
trajectories of the first do components of Y". 

"Exact" approximation of acceptance probability 

Notice that the formula [3] for Ai requires the evaluation of tF; 
at the points Xi, Xi+i £ R'^'+i . While the approximation of Wi 
by functions on R'^'+i is in general a very difficult problem, 
its evaluation at a single point is often not terribly demand- 
ing. In fact, in many cases, including the examples in this 
paper, the Xi variables can be chosen so that the remaining 
Xi variables are conditionally independent given Xi. 

Despite these mitigating factors, the requirement that we 
evaluate Wi before we accept any swap is a little onerous. For- 
tunately, and somewhat surprisingly, this requirement is not 
necessary. In fact, standard strategies for approximating the 
point values of the marginals yield Markov chains that them- 
selves preserve the target measure. Thus even a poor estimate 
of the ratio appearing in [3] can give rise to a method that is 
exact in the sense that the resulting Markov chain will asymp- 
totically sample the target measure. 

To illustrate this point, we consider the following example 
of a swap move. Assume that the current position of the chain 
is {y" — x} where 



(• 



t'Z , Xi , Xi-^1 , 



The following steps will result in either |y"+i = a;} or 
= ^iiere 



and yi € 
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1. Let v° = Xi and let G R'*i-'*i+i for j = 1, . . . , M - 1 
be independent samples from pi{ ■ \xi), where Pi{ ■ \xi) 
is a reference density conditioned by Xi. For example, 
pi{ ■ \xi) could be a Gaussian approximation of Tn{xi\xi). 
How Pi is chosen depends on the problem at hand (see 
numerical examples below). In general Pi{ ■ \xi) should 
be easily evaluated and independently sampled, and it 
should "cover" 7ri( • \xi) in the sense that areas of R''* 
where TTi{ ■ \xi) is not negligible should be contained in 
areas where pi{ ■ \xi) is not negligible. 

2. Let e R'^i -<*'+! for j = 0, . . . , M - 1 be independent 
random variables sampled from ■ |a;i+i) (recall that 
we are considering a swap of Xi and Xi+i which live in 
the same space). Notice that the {it"'} variables depend 
on Xi+i while the {v-'} variables depend on Xi. 

3. Define the weights 

. TVi (xi,V^) . m (xi + l,U^) 

"■' — — ^ and wi — — ^ 



Piivi\xi) 



Pi (U^Xi+l) 



The choice of pi made above affects the variance of these 
weights, and therefore the variance of the acceptance 
probability below. 

4. Choose yi from among the {u-'} according to the multi- 
nomial distribution with probabilities 



1=0 < 



Notice that iji is an approximate sample from 
-Ki{ ■ |a;i+i). 



5. Set 



= {. . . ,Xi+i,yi,Xi 

with probability 



Ai — mm< 1, 



/ \ v^A/— 1 j 

7ri+i(xi+i)2^j.^Q wi 



[4] 



and 



y"+l ^ Y" = {. . . , Xi, Xi,Xi+i,. . . ) 

with probability 1 — Af'. 

The transition probability density for the above swap 
move from x ^ y for x,y € M.''" x ■ • ■ x R'^^is given by 

S"{x -> J/) = (1 - 7?) 5{j,=^} 

+ -R'^{(y.,!/,+i)=(-.+i,*0} n ^{y,=-,} 



where 
R = M 



j Pi{u\Xr+l) 



2^1=0 "'u 



X Pi{v^ \xi)pi{u' \xi+-i)dv^ du^ 



and 5 is again the Dirac delta function. In other words, 
dictates that the Markov chain accepts the swap with proba- 
bility R and rejects it with probability 1 — R. 

While the preceding swap move corresponds to a method 
for approximating the ratio 

Tti{Xi + l) 
T^i{Xi) 

appearing in the formula for Ai above, it also has some simi- 
larities with the multiple-try Metropolis method presented in 
|10l which uses multiple suggestion samples to improve ac- 
ceptance rates of standard MCMC methods. The following 
lemma is suggested by results in [10] . 

Lemma 2. The transition probabilities Sf' satisfy the detailed 
balance conditton for the measure 11. 

As before the detailed balance condition guarantees that av- 
erages over trajectories of the first do dimensions of Y" will 
converge to averages over ttq. 

The Af' contain an approximation to the ratio of 
marginals in [3] 



E 



j=0 



_1_ 



i (iij ) 
j (^;J I £ j ) 



a.s. 


■'r.(^,+ l,X,) l" 


M^oo 







mixi. 



TTiiXi) 

where Ep. denotes expectation with respect to the density pi. 
When < Ep, [wj | = Xi^j < 00, the convergence above 
follows from the strong law of large numbers and the fact that 



Ep 



TVi XilXi 



Pi 



TVi{xi,Xi) 



Pi{xi\xi) dxi 



PiiXi\Xi) - 

-^i) dXi — TTii^Xi 



For small values of M in|4l calculation of the swap acceptance 
probabilities is very cheap. However, higher values of M may 
improve the acceptance rates. For example, if the {TTij^^Q are 
exact marginals of ttq, then Ai = 1 while Af' < 1. Results 
similar to Lemma 2 hold when more general approximations 
replace the one given above; for example when the {u^} and 
are generated by a Metropolis-Hastings rule. In practice 
one has to balance the speed of evaluating Af for small M 
with the possible higher acceptance rates for M large. 

It is easy to see that a Markov chain which evolves only by 
swap moves will only sample a finite number of configurations. 
These swap moves must therefore be used in conjunction with 
a transition rule that can reach any region of space, such as T 
from expression [2l More precisely, T should be H-irreducible 
and aperiodic (see [11]). The the transition rule for parallel 
marginalization is 

P{x ^ y) ^ {1~ a) T{x y) 

+ a J T{x ^ z)S {z ^ y)dz 
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where 



and a £ [0, 1) is the probability that a swap move occurs. P 
dictates that, with probability q, the chain attempts a swap 
move between levels / and 1 + 1 where J is a random vari- 
able chosen uniformly from {0,...,L — 1}. Next, each level of 
the chain evolves independently according to the {Ti}. With 
probability 1 — a the chain does not attempt a swap move, but 
does evolve each level. The next result follows trivially from 
Lemma 2 and guarantees the invariance of 11 under evolution 
by P. 

Theorem 1. The transition probability P satisfies the detailed 
balance condition for the measure 11, i.e. 

n(x) P{x^y)^n{y) P{y^x) 

where x,y £ R"^" x ■ ■ ■ x R''". 

Thus by combining standard MCMC steps on each compo- 
nent governed by the transition probability T, with swap steps 
between the components governed by S, an MCMC method 
results which not only uses information from rapidly equili- 
brating lower dimensional chains, but is also convergent. 

Numerical example 1: bridge path sampling 

In the bridge path sampling problem we wish to approximate 
conditional expectations of the form 

E[g{Z^) \{Z''=z-},{Z^ = z+}] 

where s G (0, T) and {^*} is the real valued processes given 
by the solution of the stochastic differential equation 



dZ' = f{Z^)dt + a (Z*) dW' 



[5] 



g, f and a are real valued functions of R. Of course we can 
also consider functions g of more than one time. This problem 
arises, for example, in financial volatility estimation. Because 
in general we cannot sample paths of [5] we must first approx- 
imate {^'} by a discrete process for which the path density 
is readily available. Let to = 0,ti = ^,...,tK = T be a 
mesh on which we wish to calculate path averages. One such 
approximate process is given by the linearly implicit Euler 
scheme (a balanced implicit method, see [12]). 



X" 



X' 



The {C*} are independent Gaussian random variables with 
mean and variance 1, and A = AT is assumed to be 
a power of 2. The choice of this scheme over the Euler 
scheme (see [13] ) is due to its favorable stability properties 
as explained later. Without the condition X*^' = Z^ above, 
generating samples of is a relatively strait- 

forward endeavor. One simply generates a sample of Z^ , 
then evolves the system with this initial condition. However, 
the presence of information about {•^'Ij^.q complicates the 
task. In general, some sampling method which requires only 



knowlege of a function proportional to conditional density of 
(X'l , . . . , must be applied. The approximate path 

density associated with discretization [6] is 



where 



1 \x°,x''^) cx 



V{x,y) 



exp -^■l/(a;'^x'''-+^A) [7] 



[(l-A/ (x)) (y - x) + Af {x)]' 



2<j2 (x) A 

At this point we wish to apply the parallel marginalization 
sampling procedure to the density ttq. However, as indicated 
above, a prerequisite for the use of parallel marginalization is 
the ability to estimate marginal densities. In some important 
problems homogeneities in the underlying system yield simpli- 
fications in the calculation of these densities by the methods in 
[I][2]. These calculations can be carried out before implemen- 
tation of parallel marginalization, or they can be integrated 
into the sampling procedure. 

In some cases, the numerical estimation of the {i"i}i>o ^^'^ 
be completely avoided. The examples presented here are two 
such cases. Let Si = {O, 2\ . . . , K}. Decompose Si as Si U Si 
where 

5. = {0,2(2'),4(2'),...,7^} 

and _ 

S, = {2\3(2'),5(2'),...,i^-2'}. 

In the notation of the previous sections, Xi = {xi,Xi) where 
= {^MfegsA{o,if} and Xi = {xl''}^^g^. In words, the 
hat and tilde variables represent alternating time slices of the 
path. For all i fix = z~ and a;*^' = z'^ . We choose the 
approximate marginal densities 

where for each i, qi is defined by successive coarsenings of |6] 
That is, 



= exp - ^{^1^''"'^, 



Since tt; will be sampled using a Metropolis-Hastings method 
with a;" and a;*^ fixed, knowlege of the normalization con- 
stants 

Z,{x°,xl'') = j n 

keSi\{0,K} 

is unnecessary. 

Notice from [7] that, conditioned on the values of 
and x'^'+i, the variance of 2;''= is of order A. Thus any per- 
turbation of x^*' which leaves a;*^ fixed for j ^ k and which is 
compatible with joint distribution [7] must be of the order \/A. 
This suggests that distributions defined by coarser discretiza- 
tions of [7] will allow larger perturbations, and consequently 
will be easier to sample. However, it is important to choose a 
discretization that remains stable for large values of A. For 
example, while the linearly implicit Euler method performs 
well in the experiments below, similar tests using the Euler 



4 I www.pnas.org 



Footline Author 



method were less successful due to limitations on the largest 
allowable values of A. 

In this numerical example bridge paths are sampled be- 
tween time and time 10 for a diffusion in a double well 
potential 

f{x) = —Ax (xP' -~ l) and ct(x) = 1 

The left and right end points are chosen as z~ = z'^ — 0. 

g ]^io/(2'a)+i jg ^jjg -th jgygj of the parallel marginahza- 
tion Markov chain at algorithmic time n. There are 10 chains 
(L = 9 in expression [21) . The observed swap acceptance rates 
are reported in Table 1. Let K^i^j G R denote the midpoint 
of the path defined by Yq (i.e. an approximate sample of the 
path at time 5). In Fig. 1 the autocorrelation of Y^^^ 

COrr [ymidii^mid] 

is compared to that of a standard Metropolis-Hastings rule. 
In the figure, the time scale of the autocorrelation for the 
Metropolis-Hastings method has been scaled by a factor of 
1/10 to more than account for the extra computational time 
required per iteration of parallel marginalization. The relax- 
ation time of the parallel chain is clearly reduced. In these 
numerical examples, the algorithm in the previous section is 
applied with a slight simplification. First generate M indepen- 
dent Gaussian random paths {C-' (ife)}j,g5 with independent 
components C,-' (tk) of mean and variance 2*~^A. For each 
j and k £ Si let 



(tk) = C' (tk) + 0.5 (:] 



-fa;: 



i+i 



If in step 4, iji — u-' , then in step 1 we set v'^ — Xi and for 
each k £ Si 

p (M}^^^ ^{CMM + 0.5 !-)}^^^.,. 

All other steps remain the same. This change yields a slightly 
faster though less generally applicable swap step that also 
preserves the density H. Notice that this modification implies 
that the reference density pi is given by 



Pi ixi\xi) oc exp 



E 



-0.5 (x'i'-'+i^^'))' 



For this problem, the choice of M in |4l the number of sam- 
ples of {«"' } and {«"' }, seems to have little effect on the swap 
acceptance rates. In the numerical experiment M — i + 1 for 
swaps between levels i and i + 1. 

Numerical example 2: non-linear smoothing/filtering 

In the non-linear smoothing and filtering problem we wish to 
approximate conditional expectations of the form 



E 



where s G (0, T) and the real valued processes {^'} and {i?^} 
are given by the system 

dZ* = / (Z*) dt + a (Z*) dW\ 



g, /, a, and r are real valued functions of R. The {x"'} 
real valued independent random variable drawn from the den- 
sity jj, and are independent of the Brownian motion {W^'} . 
{sj} C {tj} , and = so < si < ... < sj = T. The process 
is a hidden signal and the {H-'} are noisy observations. 

Again, the system must first be discretized. The linearly 
implicit Euler scheme gives 

+ (X'^+i - X*") / {X^") A + a {X*") x/A C^ 
X = Z X ~ i.i.d. fi. 

The {C''} a-re independent Gaussian random variables with 
mean and variance 1, and A = ^. The {(,''} are indepen- 
dent of the {x^ }- K is again assumed to be a power of 2. 
The approximate path measure for this problem is 



(^x'°,...,x''< \h°,...,h^) 



OC 



K-1 

exp I - ^ V (x*^2;*'=+^A) 



fc=0 



The approximate marginals are chosen as 



I ft" , . . . , ft"'' I OC 



where V, qi and Si are as defined in the previous section. 

In this example, samples of the smoothed path are gener- 
ated between time time and time 10 for the same diffusion 
in a double well potential. The densities pi and p are chosen 
as 

p = N{0, 0.01) and p{x) oc exp (x^ - l)^^ 



The observation times are so = 0, si = 1, . . . , sio = 10 with 
= -1 for j = 0, . . . , 5 and = 1 for j = 6, . . . , 10. There 
are 8 chains (L = 7 in expression The observed swap 
acceptance rates are reported in Table 1. Again, Y^i^ G R 
denotes the midpoint of the path defined by Yf^ (i.e. an 
approximate sample of the path at time 5). In Fig. 2 the 
autocorrelation of YJ^i^ is compared to that of a standard 
Metropohs-Hastings rule. The figure has been adjusted as 
in the previous example. The relaxation time of the parallel 
chain is again clearly reduced. The algorithm is modified as in 
the previous example. For this problem, acceptable swap rates 
require a higher choice of M in |4] than needed in the bridge 
sampling problem. In this numerical experiment M — 2' for 
swaps between levels i and i -I- 1. 
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Conclusion 

A Markov chain Monte Carlo method has been proposed 
and apphed to two conditional path sampling problems for 
stochastic differential equations. Numerical results indicate 
that this method, parallel marginalization, can have a dramat- 
ically reduced equilibration time when compared to standard 
MCMC methods. 

Note that parallel marginalization should not be viewed as 
a stand alone method. Other acceleration techniques such as 
hybrid Monte Carlo can and should be implemented at each 
level within the parallel marginalization framework. As the 



smoothing problem indicates, the acceptance probabilities at 
coarser levels can become small. The remedy for this is the 
development of more accurate approximate marginal distribu- 
tions by, for example, the methods in [1] and [2]. 
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Fig. 1. Autocorrelations of parallel marginalization and standard Metropolis-Hastings methods for bridge sampling problem. 




Fig. 2. Autocorrelations of parallel marginalization and standard Metropolis-Hastings methods for filtering and smoothing problem. 



Table 1. Swap acceptance rates for bridge sampling and filtering/smoothing problems 



Levels* 


0/1 


1/2 


2/3 


3/4 


4/5 


5/6 


6/7 


7/8 


8/9 


BS+ 


0.86 


0.83 


0.75 


0.69 


0.54 


0.45 


0.30 


0.22 


0.26 




0.86 


0.83 


0.74 


0.65 


0.46 


0.23 


0.04 


NA 


NA 



*Swaps between levels i and i -j- 1 
"f^ Bridge sampling problem 
■f- Filtering/smoothing problem 
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